{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "synthetic_features_and_outliers.ipynb",
      "version": "0.3.2",
      "views": {},
      "default_view": {},
      "provenance": [],
      "collapsed_sections": [
        "i5Ul3zf5QYvW",
        "jByCP8hDRZmM",
        "WvgxW0bUSC-c",
        "copyright-notice"
      ]
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    }
  },
  "cells": [
    {
      "metadata": {
        "id": "copyright-notice",
        "colab_type": "text"
      },
      "source": [
        "#### Copyright 2017 Google LLC."
      ],
      "cell_type": "markdown"
    },
    {
      "outputs": [],
      "execution_count": 0,
      "metadata": {
        "id": "copyright-notice2",
        "colab_type": "code",
        "cellView": "both",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "source": [
        "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
        "# you may not use this file except in compliance with the License.\n",
        "# You may obtain a copy of the License at\n",
        "#\n",
        "# https://www.apache.org/licenses/LICENSE-2.0\n",
        "#\n",
        "# Unless required by applicable law or agreed to in writing, software\n",
        "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
        "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
        "# See the License for the specific language governing permissions and\n",
        "# limitations under the License."
      ],
      "cell_type": "code"
    },
    {
      "metadata": {
        "id": "4f3CKqFUqL2-",
        "colab_type": "text",
        "slideshow": {
          "slide_type": "slide"
        }
      },
      "cell_type": "markdown",
      "source": [
        " # Atributos sint\u00e9ticas y valores at\u00edpicos"
      ]
    },
    {
      "metadata": {
        "id": "jnKgkN5fHbGy",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        " **Objetivos de aprendizaje:**\n",
        "  * crear un atributo sint\u00e9tico que sea la proporci\u00f3n de otros dos atributos\n",
        "  * usar este atributo nuevo como una entrada para un modelo de regresi\u00f3n lineal\n",
        "  * mejorar la eficacia del modelo al identificar valores at\u00edpicos y aplicarles un l\u00edmite (quitarlos) de los datos de entrada"
      ]
    },
    {
      "metadata": {
        "id": "VOpLo5dcHbG0",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        " Revisemos nuestro modelo del ejercicio anterior \"Primeros pasos con TensorFlow\". \n",
        "\n",
        "Primero, importaremos los datos de viviendas en California a un `DataFrame` de *Pandas*:"
      ]
    },
    {
      "metadata": {
        "id": "S8gm6BpqRRuh",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        " ## Preparaci\u00f3n"
      ]
    },
    {
      "metadata": {
        "id": "9D8GgUovHbG0",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "source": [
        "from __future__ import print_function\n",
        "\n",
        "import math\n",
        "\n",
        "from IPython import display\n",
        "from matplotlib import cm\n",
        "from matplotlib import gridspec\n",
        "import matplotlib.pyplot as plt\n",
        "import numpy as np\n",
        "import pandas as pd\n",
        "import sklearn.metrics as metrics\n",
        "import tensorflow as tf\n",
        "from tensorflow.python.data import Dataset\n",
        "\n",
        "tf.logging.set_verbosity(tf.logging.ERROR)\n",
        "pd.options.display.max_rows = 10\n",
        "pd.options.display.float_format = '{:.1f}'.format\n",
        "\n",
        "california_housing_dataframe = pd.read_csv(\"https://download.mlcc.google.com/mledu-datasets/california_housing_train.csv\", sep=\",\")\n",
        "\n",
        "california_housing_dataframe = california_housing_dataframe.reindex(\n",
        "    np.random.permutation(california_housing_dataframe.index))\n",
        "california_housing_dataframe[\"median_house_value\"] /= 1000.0\n",
        "california_housing_dataframe"
      ],
      "cell_type": "code",
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "I6kNgrwCO_ms",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        " A continuaci\u00f3n, configuraremos nuestra funci\u00f3n de entrada y definiremos la funci\u00f3n para el entrenamiento del modelo:"
      ]
    },
    {
      "metadata": {
        "id": "5RpTJER9XDub",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "source": [
        "def my_input_fn(features, targets, batch_size=1, shuffle=True, num_epochs=None):\n",
        "    \"\"\"Trains a linear regression model of one feature.\n",
        "  \n",
        "    Args:\n",
        "      features: pandas DataFrame of features\n",
        "      targets: pandas DataFrame of targets\n",
        "      batch_size: Size of batches to be passed to the model\n",
        "      shuffle: True or False. Whether to shuffle the data.\n",
        "      num_epochs: Number of epochs for which data should be repeated. None = repeat indefinitely\n",
        "    Returns:\n",
        "      Tuple of (features, labels) for next data batch\n",
        "    \"\"\"\n",
        "    \n",
        "    # Convert pandas data into a dict of np arrays.\n",
        "    features = {key:np.array(value) for key,value in dict(features).items()}                                           \n",
        " \n",
        "    # Construct a dataset, and configure batching/repeating.\n",
        "    ds = Dataset.from_tensor_slices((features,targets)) # warning: 2GB limit\n",
        "    ds = ds.batch(batch_size).repeat(num_epochs)\n",
        "    \n",
        "    # Shuffle the data, if specified.\n",
        "    if shuffle:\n",
        "      ds = ds.shuffle(buffer_size=10000)\n",
        "    \n",
        "    # Return the next batch of data.\n",
        "    features, labels = ds.make_one_shot_iterator().get_next()\n",
        "    return features, labels"
      ],
      "cell_type": "code",
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "VgQPftrpHbG3",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "source": [
        "def train_model(learning_rate, steps, batch_size, input_feature):\n",
        "  \"\"\"Trains a linear regression model.\n",
        "  \n",
        "  Args:\n",
        "    learning_rate: A `float`, the learning rate.\n",
        "    steps: A non-zero `int`, the total number of training steps. A training step\n",
        "      consists of a forward and backward pass using a single batch.\n",
        "    batch_size: A non-zero `int`, the batch size.\n",
        "    input_feature: A `string` specifying a column from `california_housing_dataframe`\n",
        "      to use as input feature.\n",
        "      \n",
        "  Returns:\n",
        "    A Pandas `DataFrame` containing targets and the corresponding predictions done\n",
        "    after training the model.\n",
        "  \"\"\"\n",
        "  \n",
        "  periods = 10\n",
        "  steps_per_period = steps / periods\n",
        "\n",
        "  my_feature = input_feature\n",
        "  my_feature_data = california_housing_dataframe[[my_feature]].astype('float32')\n",
        "  my_label = \"median_house_value\"\n",
        "  targets = california_housing_dataframe[my_label].astype('float32')\n",
        "\n",
        "  # Create input functions.\n",
        "  training_input_fn = lambda: my_input_fn(my_feature_data, targets, batch_size=batch_size)\n",
        "  predict_training_input_fn = lambda: my_input_fn(my_feature_data, targets, num_epochs=1, shuffle=False)\n",
        "  \n",
        "  # Create feature columns.\n",
        "  feature_columns = [tf.feature_column.numeric_column(my_feature)]\n",
        "    \n",
        "  # Create a linear regressor object.\n",
        "  my_optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)\n",
        "  my_optimizer = tf.contrib.estimator.clip_gradients_by_norm(my_optimizer, 5.0)\n",
        "  linear_regressor = tf.estimator.LinearRegressor(\n",
        "      feature_columns=feature_columns,\n",
        "      optimizer=my_optimizer\n",
        "  )\n",
        "\n",
        "  # Set up to plot the state of our model's line each period.\n",
        "  plt.figure(figsize=(15, 6))\n",
        "  plt.subplot(1, 2, 1)\n",
        "  plt.title(\"Learned Line by Period\")\n",
        "  plt.ylabel(my_label)\n",
        "  plt.xlabel(my_feature)\n",
        "  sample = california_housing_dataframe.sample(n=300)\n",
        "  plt.scatter(sample[my_feature], sample[my_label])\n",
        "  colors = [cm.coolwarm(x) for x in np.linspace(-1, 1, periods)]\n",
        "\n",
        "  # Train the model, but do so inside a loop so that we can periodically assess\n",
        "  # loss metrics.\n",
        "  print(\"Training model...\")\n",
        "  print(\"RMSE (on training data):\")\n",
        "  root_mean_squared_errors = []\n",
        "  for period in range (0, periods):\n",
        "    # Train the model, starting from the prior state.\n",
        "    linear_regressor.train(\n",
        "        input_fn=training_input_fn,\n",
        "        steps=steps_per_period,\n",
        "    )\n",
        "    # Take a break and compute predictions.\n",
        "    predictions = linear_regressor.predict(input_fn=predict_training_input_fn)\n",
        "    predictions = np.array([item['predictions'][0] for item in predictions])\n",
        "    \n",
        "    # Compute loss.\n",
        "    root_mean_squared_error = math.sqrt(\n",
        "      metrics.mean_squared_error(predictions, targets))\n",
        "    # Occasionally print the current loss.\n",
        "    print(\"  period %02d : %0.2f\" % (period, root_mean_squared_error))\n",
        "    # Add the loss metrics from this period to our list.\n",
        "    root_mean_squared_errors.append(root_mean_squared_error)\n",
        "    # Finally, track the weights and biases over time.\n",
        "    # Apply some math to ensure that the data and line are plotted neatly.\n",
        "    y_extents = np.array([0, sample[my_label].max()])\n",
        "    \n",
        "    weight = linear_regressor.get_variable_value('linear/linear_model/%s/weights' % input_feature)[0]\n",
        "    bias = linear_regressor.get_variable_value('linear/linear_model/bias_weights')\n",
        "    \n",
        "    x_extents = (y_extents - bias) / weight\n",
        "    x_extents = np.maximum(np.minimum(x_extents,\n",
        "                                      sample[my_feature].max()),\n",
        "                           sample[my_feature].min())\n",
        "    y_extents = weight * x_extents + bias\n",
        "    plt.plot(x_extents, y_extents, color=colors[period]) \n",
        "  print(\"Model training finished.\")\n",
        "\n",
        "  # Output a graph of loss metrics over periods.\n",
        "  plt.subplot(1, 2, 2)\n",
        "  plt.ylabel('RMSE')\n",
        "  plt.xlabel('Periods')\n",
        "  plt.title(\"Root Mean Squared Error vs. Periods\")\n",
        "  plt.tight_layout()\n",
        "  plt.plot(root_mean_squared_errors)\n",
        "\n",
        "  # Create a table with calibration data.\n",
        "  calibration_data = pd.DataFrame()\n",
        "  calibration_data[\"predictions\"] = pd.Series(predictions)\n",
        "  calibration_data[\"targets\"] = pd.Series(targets)\n",
        "  display.display(calibration_data.describe())\n",
        "\n",
        "  print(\"Final RMSE (on training data): %0.2f\" % root_mean_squared_error)\n",
        "  \n",
        "  return calibration_data"
      ],
      "cell_type": "code",
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "FJ6xUNVRm-do",
        "colab_type": "text",
        "slideshow": {
          "slide_type": "slide"
        }
      },
      "cell_type": "markdown",
      "source": [
        " ## Tarea\u00a01: Prueba un atributo sint\u00e9tico\n",
        "\n",
        "Los atributos `total_rooms` y `population` cuentan los totales para una manzana espec\u00edfica.\n",
        "\n",
        "Pero, \u00bfqu\u00e9 ocurrir\u00eda si una manzana estuviese m\u00e1s densamente poblada que otra? Podemos explorar c\u00f3mo se relaciona la densidad de las manzanas con el valor mediano de las casas al crear un atributo sint\u00e9tico que sea una proporci\u00f3n de `total_rooms` y `population`.\n",
        "\n",
        "En la celda a continuaci\u00f3n, crea un atributo denominado `rooms_per_person` y \u00fasalo como la `input_feature` para `train_model()`.\n",
        "\n",
        "\u00bfCu\u00e1l es el mejor rendimiento que puedes obtener con este \u00fanico atributo al ajustar la tasa de aprendizaje? (Cuanto mejor sea el rendimiento, mejor se ajustar\u00e1 la l\u00ednea de regresi\u00f3n a los datos y m\u00e1s baja ser\u00e1 la RMSE final)."
      ]
    },
    {
      "metadata": {
        "id": "isONN2XK32Wo",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        " **NOTA**: Es posible que te resulte \u00fatil agregar algunas celdas de c\u00f3digo a continuaci\u00f3n para probar varias tasas de aprendizaje diferentes y comparar los resultados. Para agregar una celda de c\u00f3digo nueva, desplaza el cursor directamente debajo del centro de esta celda y haz clic en **CODE** (C\u00d3DIGO)."
      ]
    },
    {
      "metadata": {
        "id": "5ihcVutnnu1D",
        "colab_type": "code",
        "slideshow": {
          "slide_type": "slide"
        },
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          },
          "test": {
            "output": "ignore",
            "timeout": 600
          }
        },
        "cellView": "both"
      },
      "source": [
        "#\n",
        "# YOUR CODE HERE\n",
        "#\n",
        "california_housing_dataframe[\"rooms_per_person\"] =\n",
        "\n",
        "calibration_data = train_model(\n",
        "    learning_rate=0.00005,\n",
        "    steps=500,\n",
        "    batch_size=5,\n",
        "    input_feature=\"rooms_per_person\"\n",
        ")"
      ],
      "cell_type": "code",
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "i5Ul3zf5QYvW",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        " ### Soluci\u00f3n\n",
        "\n",
        "Haz clic m\u00e1s abajo para conocer la soluci\u00f3n."
      ]
    },
    {
      "metadata": {
        "id": "Leaz2oYMQcBf",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "source": [
        "california_housing_dataframe[\"rooms_per_person\"] = (\n",
        "    california_housing_dataframe[\"total_rooms\"] / california_housing_dataframe[\"population\"])\n",
        "\n",
        "calibration_data = train_model(\n",
        "    learning_rate=0.05,\n",
        "    steps=500,\n",
        "    batch_size=5,\n",
        "    input_feature=\"rooms_per_person\")"
      ],
      "cell_type": "code",
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "ZjQrZ8mcHFiU",
        "colab_type": "text",
        "slideshow": {
          "slide_type": "slide"
        }
      },
      "cell_type": "markdown",
      "source": [
        " ## Tarea\u00a02: Identifica valores at\u00edpicos\n",
        "\n",
        "Para visualizar el rendimiento del modelo, podemos crear una representaci\u00f3n de dispersi\u00f3n de las predicciones frente a los valores objetivo. Idealmente, estos deber\u00edan estar en una l\u00ednea diagonal perfectamente correlacionada.\n",
        "\n",
        "Usa `scatter()` de Pyplot para crear una representaci\u00f3n de dispersi\u00f3n de predicciones frente a valores objetivo con el modelo de habitaciones por persona que entrenaste en la Tarea\u00a01.\n",
        "\n",
        "\u00bfPuedes ver singularidades? Rastr\u00e9alas hasta los datos de origen al observar la distribuci\u00f3n de los valores en `rooms_per_person`."
      ]
    },
    {
      "metadata": {
        "id": "P0BDOec4HbG_",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "source": [
        "# YOUR CODE HERE"
      ],
      "cell_type": "code",
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "jByCP8hDRZmM",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        " ### Soluci\u00f3n\n",
        "\n",
        "Haz clic m\u00e1s abajo para conocer la soluci\u00f3n."
      ]
    },
    {
      "metadata": {
        "id": "s0tiX2gdRe-S",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "source": [
        "plt.figure(figsize=(15, 6))\n",
        "plt.subplot(1, 2, 1)\n",
        "plt.scatter(calibration_data[\"predictions\"], calibration_data[\"targets\"])"
      ],
      "cell_type": "code",
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "kMQD0Uq3RqTX",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        " Los datos de calibraci\u00f3n muestran la mayor\u00eda de los puntos de dispersi\u00f3n alineados con una l\u00ednea. Esta es casi vertical, pero regresaremos a eso m\u00e1s adelante. Ahora concentr\u00e9monos en aquellas que se desv\u00edan de la l\u00ednea. Observamos que la cantidad es relativamente baja.\n",
        "\n",
        "Si representamos un histograma de `rooms_per_person`, detectamos que tenemos algunos valores at\u00edpicos en nuestros datos de entrada:"
      ]
    },
    {
      "metadata": {
        "id": "POTM8C_ER1Oc",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "source": [
        "plt.subplot(1, 2, 2)\n",
        "_ = california_housing_dataframe[\"rooms_per_person\"].hist()"
      ],
      "cell_type": "code",
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "9l0KYpBQu8ed",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        " ## Tarea\u00a03: Ajusta los valores at\u00edpicos\n",
        "\n",
        "Comprueba si puedes mejorar a\u00fan m\u00e1s la preparaci\u00f3n del modelo al establecer los valores at\u00edpicos de `rooms_per_person` en un m\u00e1ximo o m\u00ednimo razonables.\n",
        "\n",
        "Como referencia, aqu\u00ed se incluye un ejemplo r\u00e1pido de c\u00f3mo aplicar una funci\u00f3n a una `Series` de Pandas:\n",
        "\n",
        "    clipped_feature = my_dataframe[\"my_feature_name\"].apply(lambda x: max(x, 0))\n",
        "\n",
        "La `clipped_feature` anterior no tendr\u00e1 valores inferiores a `0`."
      ]
    },
    {
      "metadata": {
        "id": "rGxjRoYlHbHC",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "source": [
        "# YOUR CODE HERE"
      ],
      "cell_type": "code",
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "WvgxW0bUSC-c",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        " ### Soluci\u00f3n\n",
        "\n",
        "Haz clic m\u00e1s abajo para conocer la soluci\u00f3n."
      ]
    },
    {
      "metadata": {
        "id": "8YGNjXPaSMPV",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        " El histograma que creamos en la Tarea\u00a02 muestra que la mayor\u00eda de los valores son inferiores a `5`. Ajustemos `rooms_per_person` en 5 y representemos un histograma para volver a comprobar los resultados."
      ]
    },
    {
      "metadata": {
        "id": "9YyARz6gSR7Q",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "source": [
        "california_housing_dataframe[\"rooms_per_person\"] = (\n",
        "    california_housing_dataframe[\"rooms_per_person\"]).apply(lambda x: min(x, 5))\n",
        "\n",
        "_ = california_housing_dataframe[\"rooms_per_person\"].hist()"
      ],
      "cell_type": "code",
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "vO0e1p_aSgKA",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        " Para verificar si funcion\u00f3 el l\u00edmite aplicado, volvamos a realizar el entrenamiento e imprimamos los datos de calibraci\u00f3n una vez m\u00e1s:"
      ]
    },
    {
      "metadata": {
        "id": "ZgSP2HKfSoOH",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "source": [
        "calibration_data = train_model(\n",
        "    learning_rate=0.05,\n",
        "    steps=500,\n",
        "    batch_size=5,\n",
        "    input_feature=\"rooms_per_person\")"
      ],
      "cell_type": "code",
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "gySE-UgfSony",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "source": [
        "_ = plt.scatter(calibration_data[\"predictions\"], calibration_data[\"targets\"])"
      ],
      "cell_type": "code",
      "execution_count": 0,
      "outputs": []
    }
  ]
}
